---
title: Reading JSON Data from HDFS
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->

Use the PXF HDFS Connector to read JSON-format data. This section describes how to use PXF to access JSON data in HDFS, including how to create and query an external table that references a JSON file in the HDFS data store.

## <a id="prereq"></a>Prerequisites

Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read data from HDFS.

## <a id="hdfsjson_work"></a>Working with JSON Data

JSON is a text-based data-interchange format. JSON data is typically stored in a file with a `.json` suffix. 

A `.json` file will contain a collection of objects. A JSON object is a collection of unordered name/value pairs. A value can be a string, a number, true, false, null, or an object or an array. You can define nested JSON objects and arrays.

Sample JSON data file content:

``` json
  {
    "created_at":"MonSep3004:04:53+00002013",
    "id_str":"384529256681725952",
    "user": {
      "id":31424214,
      "location":"COLUMBUS"
    },
    "coordinates":{
      "type":"Point",
      "values":[
         13,
         99
      ]
    }
  }
```

In the sample above, `user` is an object composed of fields named `id` and `location`. To specify the nested fields in the `user` object as Greenplum Database external table columns, use `.` projection:

``` pre
user.id
user.location
```

`coordinates` is an object composed of a text field named `type` and an array of integers named `values`. 

To retrieve all of the elements of the `values` array in a single column, define the corresponding Greenplum Database external table column as type `TEXT[]`.

``` pre
"coordinates.values" TEXT[]
```

Refer to [Introducing JSON](http://www.json.org/) for detailed information on JSON syntax.

### <a id="datatypemap_json"></a>JSON to Greenplum Database Data Type Mapping</a>

To represent JSON data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type. JSON supports complex data types including projections and arrays. Use N-level projection to map members of nested objects and arrays to primitive data types.

The following table summarizes external mapping rules for JSON data.

<caption><span class="tablecap">Table 1. JSON Mapping</span></caption>

<a id="topic_table_jsondatamap"></a>

| JSON Data Type                                                    | PXF/Greenplum Data Type                                                                                                                                                                                            |
|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Primitive type (integer, float, string, boolean, null) | Use the corresponding Greenplum Database built-in data type; see [Greenplum Database Data Types](https://gpdb.docs.pivotal.io/latest/ref_guide/data_types.html). |
| Array                         | Use `TEXT[]` to retrieve the JSON array as a Greenplum text array.                                                                                           |
| Object                | Use dot `.` notation to specify each level of projection (nesting) to a member of a primitive or Array type.                                                                                         |

### <a id="topic_jsonreadmodes"></a>JSON Data Read Modes

PXF supports two data read modes. The default mode expects one full JSON record per line.  PXF also supports a read mode operating on JSON records that span multiple lines.

In upcoming examples, you will use both read modes to operate on a sample data set. The schema of the sample data set defines objects with the following member names and value data types:

   - "created_at" - text
   - "id_str" - text
   - "user" - object
      - "id" - integer
      - "location" - text
   - "coordinates" - object (optional)
      - "type" - text
      - "values" - array
         - [0] - integer
         - [1] - integer


The single-JSON-record-per-line data set follows:

``` pre
{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values": [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{"id":287819058,"location":""}, "coordinates": null}
```

This is the data set for the multi-line JSON record data set:

``` json
{
  "root":[
    {
      "record_obj":{
        "created_at":"MonSep3004:04:53+00002013",
        "id_str":"384529256681725952",
        "user":{
          "id":31424214,
          "location":"COLUMBUS"
        },
        "coordinates":null
      },
      "record_obj":{
        "created_at":"MonSep3004:04:54+00002013",
        "id_str":"384529260872228864",
        "user":{
          "id":67600981,
          "location":"KryberWorld"
        },
        "coordinates":{
          "type":"Point",
          "values":[
             8,
             52
          ]
        }
      }
    }
  ]
}
```

You will create JSON files for the sample data sets and add them to HDFS in the next section.

## <a id="jsontohdfs"></a>Loading the Sample JSON Data to HDFS

The PXF HDFS connector reads native JSON stored in HDFS. Before you can use Greenplum Database to query JSON format data, the data must reside in your HDFS data store.

Copy and paste the single line JSON record sample data set above to a file named `singleline.json`. Similarly, copy and paste the multi-line JSON record data set to a file named `multiline.json`.

**Note**: Ensure that there are **no** blank lines in your JSON files.

Copy the JSON data files that you just created to your HDFS data store. Create the `/data/pxf_examples` directory if you did not do so in a previous exercise. For example:

``` shell
$ hdfs dfs -mkdir /data/pxf_examples
$ hdfs dfs -put singleline.json /data/pxf_examples
$ hdfs dfs -put multiline.json /data/pxf_examples
```

Once the data is loaded to HDFS, you can use Greenplum Database and PXF to query and analyze the JSON data.


## <a id="json_cet"></a>Creating the External Table

Use the `hdfs:json` profile to read JSON-format files from HDFS. The following syntax creates a Greenplum Database readable external table that references such a file:

``` sql
CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:json[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

The specific keywords and values used in the Greenplum Database [CREATE EXTERNAL TABLE](https://gpdb.docs.pivotal.io/latest/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.

| Keyword  | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;file\>    | The path to the directory or file in the HDFS data store. When the `<server_name>` configuration includes a [`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF considers \<path&#8209;to&#8209;hdfs&#8209;file\> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. \<path&#8209;to&#8209;hdfs&#8209;file\> must not specify a relative path nor include the dollar sign (`$`) character. |
| PROFILE    | The `PROFILE` keyword must specify `hdfs:json`. |
| SERVER=\<server_name\>    | The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified. |
| \<custom&#8209;option\>  | \<custom-option\>s are discussed below.|
| FORMAT 'CUSTOM' | Use `FORMAT` `'CUSTOM'` with  the `hdfs:json` profile. The `CUSTOM` `FORMAT` requires that you specify `(FORMATTER='pxfwritable_import')`. |

<a id="customopts"></a>
PXF supports single- and multi- line JSON records. When you want to read multi-line JSON records, you must provide an `IDENTIFIER` \<custom-option\> and value. Use this \<custom-option\> to identify the name of a field whose parent JSON object you want to be returned as individual tuples.

The `hdfs:json` profile supports the following \<custom-option\>s:

| Option Keyword  | &nbsp;&nbsp;Syntax,&nbsp;&nbsp;Example(s)&nbsp;&nbsp; | Description |
|-------|--------------|-----------------------|
| IDENTIFIER  | `&IDENTIFIER=<value>`<br>`&IDENTIFIER=created_at`| You must include the `IDENTIFIER` keyword and \<value\> in the `LOCATION` string only when you are accessing JSON data comprised of multi-line records. Use the \<value\> to identify the name of the field whose parent JSON object you want to be returned as individual tuples. | 
| IGNORE_MISSING_PATH | `&IGNORE_MISSING_PATH=<boolean>` | Specify the action to take when \<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |

<div class="note">When a nested object in a multi-line record JSON file includes a field with the same name as that of a parent object field <i>and</i> the field name is also specified as the <code>IDENTIFIER</code>, there is a possibility that PXF could return incorrect results. Should you need to, you can work around this edge case by compressing the JSON file, and having PXF read the compressed file.</div>


## <a id="jsonexample1"></a>Example: Reading a JSON File with Single Line Records

Use the following [CREATE EXTERNAL TABLE](https://gpdb.docs.pivotal.io/latest/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) SQL command to create a readable external table that references the single-line-per-record JSON data file and uses the PXF default server.

```sql
CREATE EXTERNAL TABLE singleline_json_tbl(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values" TEXT[]
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

Notice the use of `.` projection to access the nested fields in the `user` and `coordinates` objects.

To query the JSON data in the external table:

``` sql
SELECT * FROM singleline_json_tbl;
```

To access specific elements of the `coordinates.values` array, you can specify the array subscript number in square brackets:

```sql
SELECT "coordinates.values"[1], "coordinates.values"[2] FROM singleline_json_tbl;
```

To access the array elements as some type other than `TEXT`, you can either cast the whole column:

```sql
SELECT "coordinates.values"::int[] FROM singleline_json_tbl;
```

or cast specific elements:

```sql
SELECT "coordinates.values"[1]::int, "coordinates.values"[2]::float FROM singleline_json_tbl;
```


## <a id="jsonexample2"></a>Example: Reading a JSON file with Multi-Line Records

The SQL command to create a readable external table from the multi-line-per-record JSON file is very similar to that of the single line data set above. You must additionally specify the `LOCATION` clause `IDENTIFIER` keyword and an associated value when you want to read multi-line JSON records. For example:

``` sql
CREATE EXTERNAL TABLE multiline_json_tbl(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values" TEXT[]
)
LOCATION('pxf://data/pxf_examples/multiline.json?PROFILE=hdfs:json&IDENTIFIER=created_at')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

`created_at` identifies the member name of the first field in the JSON record `record_obj` in the sample data schema.

To query the JSON data in this external table:

``` sql
SELECT * FROM multiline_json_tbl;
```

## <a id="array_other"></a> Other Methods to Read a JSON Array

Starting in version 6.2.0, PXF supports reading a JSON array into a `TEXT[]` column. PXF still supports the old methods of using array element projection or a single text-type column to read a JSON array. These access methods are described here.

### <a id="arrayelproj"></a>Using Array Element Projection

PXF supports accessing specific elements of a JSON array using the syntax `[n]` in the table definition to identify the specific element.

```sql
CREATE EXTERNAL TABLE singleline_json_tbl_aep(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values[0]" INTEGER,
  "coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

**Note**: When you use this method to identify specific array elements, PXF provides _only_ those values to Greenplum Database, not the whole JSON array.

If your existing external table definition uses array element projection and you want to read the array into a `TEXT[]` column, you can use the `ALTER EXTERNAL TABLE` command to update the table definition. For example:

```sql
ALTER EXTERNAL TABLE singleline_json_tbl_aep DROP COLUMN "coordinates.values[0]", DROP COLUMN "coordinates.values[1]", ADD COLUMN "coordinates.values" TEXT[];
```

If you choose to alter the external table definition in this manner, be sure to update any existing queries on the external table to account for the changes to column name and type.

### <a id="singletextcol"></a> Specifying a Single Text-type Column

PXF supports accessing all of the elements within an array as a single string containing the serialized JSON array by defining the corresponding Greenplum table column with one of the following data types: `TEXT`, `VARCHAR`, or `BPCHAR`.

```sql
CREATE EXTERNAL TABLE singleline_json_tbl_stc(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values" TEXT
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

If you retrieve the JSON array in a single text-type column and wish to convert the JSON array serialized as `TEXT` back into a native Greenplum array type, you can use the example query below:

```sql
SELECT user.id,
       ARRAY(SELECT json_array_elements_text(coordinates.values::json))::int[] AS coords
FROM singleline_json_tbl_stc;
```

**Note**: This conversion is possible only when you are using PXF with Greenplum Database 6.x; the function `json_array_elements_text()` is not available in Greenplum 5.x.

If your external table definition uses a single text-type column for a JSON array and you want to read the array into a `TEXT[]` column, you can use the `ALTER EXTERNAL TABLE` command to update the table definition. For example:

```sql
ALTER EXTERNAL TABLE singleline_json_tbl_stc ALTER COLUMN "coordinates.values" TYPE TEXT[];
```

If you choose to alter the external table definition in this manner, be sure to update any existing queries on the external table to account for the change in column type.

